This practice is part of the subject Biomedical Data Science of the Degree in Data Science of the Universitat Politècnica de València, and taught by the Department of Applied Physics.
The measurement of data quality dimensions (DQ) is the central axis for the evaluation and improvement of data quality as well as for its correct and optimal use. Data quality dimensions are individual aspects or constructs that represent data quality attributes. To these can be associated one or more metrics, quantified with specific methods, as well as exploratory methods.
This practice is intended to provide an initial basis for the evaluation of DQ metrics. It will consist of the application of a series of methods for different dimensions of DQ. In the context of the maternal and child clinical setting, we will analyze a data file whose ultimate purpose is the monitoring of care indicators in this setting. Depending on the dimension, we will apply the methods and calculate the metrics both in general for the whole dataset and monitored over batches of time (by months), simulating the results of a DQ monitoring and continuous improvement system.
In some parts of the code we will find the text ##TODO## that we will need to complete. Additionally, we will have to discuss the results in those points where it is indicated. The deliverable of the practice will consist of the compilation in html of this R markdown file, using Knit, where the results of the execution and figures will be observed, and having completed the ##TODO## and commented the results.
We check that the working directory is in the one where we have the practice file and the folder with the data:
getwd()
Otherwise, we set it (change the example directory to ours):
setwd("D:/Users/Vartotojas/Documents/GitHub/BiomedicalDataScience_LAB05")
We install the required libraries and then load them into the working environment.
install.packages("zoo", repos = "http://cran.us.r-project.org")
install.packages("rts", repos = "http://cran.us.r-project.org")
install.packages("plotly", repos = "http://cran.us.r-project.org")
install.packages("devtools", repos = "http://cran.us.r-project.org")
library("devtools")
devtools::install_github('c5sire/datacheck')
devtools::install_github("hms-dbmi/EHRtemporalVariability")
library("zoo")
library("rts")
library("plotly")
library("datacheck")
library("EHRtemporalVariability")
We set the initial parameters of the data. The main date of the records, which will be used for the purpose of monitoring the delivery care indicators, is the date of birth.
# File name
fileName = "data/DQIinfantFeeding.csv"
# Whether it has a header or not
hasHeader = TRUE
# Main date column to be used for monitoring purposes
dateColumn = 'BIRTH_DATE'
# Format of the previous date
dateFormat = '%d/%m/%Y'
# Which text string will represent missing data
missingValue = NA # Currently, there are no missing values in the date column
We load the file data/DQIinfantFeeding.csv in a data.frame named repository:
repository <- read.csv2(fileName, header=hasHeader, na.strings=missingValue)
# We collect the number of rows and columns
N <- nrow(repository)
D <- ncol(repository)
For monitoring purposes, we will use the zoo library (S3 Infrastructure for Regular and Irregular Time Series - Z’s Ordered Observations) to convert the data, the data.frame, to a format suited for batch analyses, the zoo format.
zooRepository <- read.zoo(repository,format = dateFormat,index.column = dateColumn)
One of the main uses of the maternal and infant data repository studied is the monitoring of quality of care indicators. In the field of newborn feeding, one of the most important indicators is whether there has been early initiation of breastfeeding in the delivery room. To calculate this indicator, we create the following function that will obtain the indicator for each batch of data received, so that we can apply it repeatedly for each batch given a frequency.
indicatorEBF_delroom <- function(dataset){
numerator = (dataset$NEO_MOMFBF_TYPE %in% 'During the 1st hour') &
(dataset$NEO_PFBF_TYPE %in% 'Delivery room') &
(dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))
denominator = (dataset$NEO_MOMFBF_TYPE %in% c('During the 1st hour', 'During the 2nd hour', 'During the 3rd hour','Breastfeeding does not start')) &
(dataset$NEO_PFBF_TYPE %in% c('Delivery room', 'Hospitalization unit', 'Breastfeeding does not start')) &
!(dataset$NEO_FBFEVAL_TY %in% 'Undesired breastfeeding') &
(dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))
indicator = sum(numerator)/sum(denominator) * 100
return(indicator)
}
Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:
resIndicatorES2SC_delroom =apply.monthly(zooRepository, FUN=indicatorEBF_delroom)
plot(resIndicatorES2SC_delroom,xlab = "Date", ylab ="%",main = "Early breastfeeding start in the delivery room", ylim=c(0,100))
📝 DISCUSS RESULTS
When plotting the function by monthly data, we get a time series plot, which indicates the percentage of cases with early breastfeeding for the given time period.
We can see that a large chunk of data between mid-2009 and early 2011 is missing. We can also find that a majority of the graph lies between 80%, meaning that \(\frac{4}{5}\) of cases show early breastfeeding start in the delivery room.
We will find the missing data in the repository and calculate the corresponding metrics. First, for each variable:
NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
completenessByColumn
## PATIENT_ID DEL_TYPE_A DEL_MODE DEL_TYPE_ANE DEL_IN_INTRACA
## 100.00 100.00 100.00 100.00 100.00
## BIRTH_DATE NEO_ES2SC_TYPE NEO_ES2SCP_TYPE NEO_MOMFBF_TYPE NEO_PFBF_TYPE
## 100.00 100.00 100.00 100.00 100.00
## NEO_FBFEVAL_TY NEO_APGAR_NUM NEO_WEIGHT_NUM NEO_PHCORD_TXT NEO_BF_DIS
## 100.00 31.31 95.42 100.00 100.00
Next, we will calculate and display the overall percentage of missing data:
completenessByDataset = sum(!is.na(repository))/(N*D) * 100
100 - completenessByDataset
## [1] 4.884679
📝 DISCUSS RESULTS
Applying the completenessByColumn function, we find that
almost \(69\%\) of NEO_APGAR_NUM column
data is missing and \(4.58\%\) of
NEO_WEIGHT_NUM column data is also missing. These two columns overall
conclude, that we have \(4.88\%\)
missing values in our data set, the majority being in the NEO_APGAR_NUM
column.
To monitor the completeness by temporary batches we will create a function that does this calculation for each batch it receives as a parameter, and returns the completeness for each column, the function dimCompletessByColumn:
dimCompletenessByColumn <- function(repository){
N = dim(repository)[1]
NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
return(completenessByColumn)
}
Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:
resCompletenessByColumn = apply.monthly(zooRepository, FUN=dimCompletenessByColumn
)
Now, we can create a plot with the results using the plotly library (Create Interactive Web Graphics via ‘plotly.js’). First for each variable:
p <-
plot_ly(
x = index(resCompletenessByColumn),
y = resCompletenessByColumn[, 1],
name = names(resCompletenessByColumn)[1],
type = 'scatter',
mode = 'lines'
) %>%
plotly::layout(
title = 'Completeness by month',
xaxis = list(title = "Date"),
yaxis = list (title = "Completeness (%)")
)
for (i in 2:ncol(resCompletenessByColumn)) {
p = p %>% plotly::add_trace(y = resCompletenessByColumn[, i],
name = names(resCompletenessByColumn)[i],
mode = 'lines')
}
p
And secondly globally, being able to calculate the result from the variable resCompletenessByColumn and use the code to plot a single series from the previous code chunk:
dimCompletenessByDataset <- function(repository){
N = dim(repository)[1]
D = dim(repository)[2]
completenessByDataset = sum(!is.na(repository))/(N*D) * 100
return(completenessByDataset)
}
resCompletenessByDataset = apply.monthly(zooRepository, FUN=dimCompletenessByDataset
)
resCompletenessByDataset
## 2009-01-31 2009-02-28 2009-03-31 2009-04-30 2009-05-30 2009-06-30 2009-07-31
## 96.71053 94.92063 94.20849 93.83754 93.87755 93.44423 95.08197
## 2009-08-31 2009-09-30 2009-10-31 2009-11-30 2009-12-31 2010-01-31 2010-02-27
## 94.12442 95.12472 94.37690 94.57672 93.17227 94.19643 94.80519
## 2010-03-31 2010-04-30 2010-05-31 2010-06-30 2010-07-31 2010-08-31 2010-09-30
## 94.47005 94.50549 93.86318 95.20089 93.47826 95.01348 96.27329
## 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30
## 94.73684 94.51346 94.97354 95.31616 96.20061 96.27329 95.11278
## 2011-05-31 2011-06-30 2011-07-31 2011-08-31 2011-09-30 2011-10-31 2011-11-30
## 95.39171 94.26020 96.04592 94.16910 95.39683 94.81793 93.30357
## 2011-12-31 2012-01-31 2012-02-29 2012-03-31 2012-04-28 2012-05-31 2012-06-30
## 94.79167 94.58525 95.89947 95.23810 96.42857 94.49405 94.87179
## 2012-07-29 2012-08-31 2012-09-30 2012-10-31 2012-11-29 2012-12-31 2013-01-31
## 95.77259 94.92754 95.34161 95.85714 95.83333 96.56593 95.18950
## 2013-02-28 2013-03-31 2013-04-30 2013-05-31 2013-06-30 2013-07-31 2013-08-31
## 95.80745 95.48872 94.59459 93.98496 93.92857 94.64286 94.04762
## 2013-09-30 2013-10-31 2013-11-29 2013-12-31 2014-01-31 2014-02-28 2014-03-29
## 95.47038 95.84718 94.64286 95.18272 94.74394 95.55256 94.80519
## 2014-04-29 2014-05-31 2014-06-30 2014-07-31 2014-08-31 2014-09-30 2014-10-30
## 95.51495 95.08929 94.64286 94.36813 94.00000 93.59606 92.85714
## 2014-11-27 2014-12-31
## 92.17687 91.22024
p <-
plot_ly(
x = index(resCompletenessByDataset),
y = resCompletenessByDataset,
type = 'scatter',
mode = 'lines'
) %>%
plotly::layout(
title = 'Global completeness by month',
xaxis = list(title = "Date"),
yaxis = list (title = "Completeness (%)")
)
p
📝 DISCUSS RESULTS
…
We are going to analyze two multivariate consistency rules in the data. For this we will use the datacheck library (Tools for Checking Data Consistency), which allows us to write logical rules in R language, using the variable names of our data. These rules will be written in the attached file rules.R, the first one is provided as an example, and the second one should be coded based on the provided natural language expression rule.
# We read the rules file
rules = read_rules("rules.R")
# We evaluate the rules on our data
profile = datadict_profile(repository, rules)
# We show the account of errors for each rule
knitr::kable(profile$checks[,c(1,3,6)])
| Variable | Rule | Error.sum |
|---|---|---|
| DEL_MODE | !(sapply(DEL_MODE, tolower) %in% “elective caesarea”) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“epidural anesthesia”, “epidural anesthesia / general anesthesia”, “spinal anesthesia”, “spinal anesthesia / general anesthesia”, “general anesthesia”)) | 220 |
# We list the cases that have been marked as inconsistent for each rule
knitr::kable(profile$checks[,c(1,7)])
| Variable | Error.list |
|---|---|
| DEL_MODE | 472,500,525,583,636,638,654,683,698,699,729,793,802,811,821,838,846,865,866,890,918,969,975,985,987,1008,1010,1013,1066,1076,1082,1129,1150,1190,1194,1241,1246,1270,1300,1367,1383,1384,1388,1402,1411,1428,1433,1445,1446,1453,1459,1501,1508,1531,1559,1575,1578,1612,1616,1635,1665,1667,1669,1674,1685,1692,1719,1753,1770,1774,1781,1784,1787,1790,1791,1800,1802,1836,1839,1840,1855,1857,1861,1874,1883,1902,1925,1928,1933,1939,1950,1963,1970,1972,1984,1987,2000,2001,2010,2049,2055,2058,2066,2102,2122,2134,2135,2145,2169,2251,2278,2292,2322,2334,2348,2368,2369,2386,2401,2409,2457,2474,2493,2518,2530,2550,2551,2563,2583,2591,2592,2623,2624,2632,2647,2676,2679,2680,2686,2706,2716,2732,2762,2767,2778,2822,2857,2865,2883,2901,2907,2920,2923,2934,2935,2943,2963,2965,2981,2985,3002,3006,3016,3018,3024,3029,3039,3073,3099,3106,3107,3119,3126,3158,3179,3183,3188,3201,3216,3221,3222,3240,3254,3256,3259,3323,3344,3356,3359,3363,3365,3380,3391,3410,3418,3427,3443,3484,3519,3531,3550,3553,3560,3575,3592,3603,3618,3622,3627,3630,3638,3648,3660,3670,3676,3677,3698,3704,3707,3761 |
📝 DISCUSS RESULTS
…
We are going to analyze if there are pattern changes in the data distributions over time. To do this we will use the EHRtemporalVariability library (Delineating Temporal Dataset Shifts in Electronic Health Records). First, we change to basic type Date the case date, originally in text format:
repositoryFormatted <- EHRtemporalVariability::formatDate(
input = repository,
dateColumn = "BIRTH_DATE",
dateFormat = dateFormat
)
We obtain the temporal maps from the already formatted repository using the function estimateDataTemporalMap and selecting a monthly period. We can get the help of the function by typing ?estimateDataTemporalMap in the console (it is only necessary to pass the data, the column with the date, and the desired period, the rest is obtained automatically from the data).
probMaps <- estimateDataTemporalMap(
data = repositoryFormatted,
dateColumnName = "BIRTH_DATE",
period = "month"
)
Next we obtain the information geometry plots from the previously estimated temporal maps. To do this we will use the function estimateIGTProjection on the list of temporal maps. We are going to save the results in a variable.
igtProjs <- sapply( probMaps, estimateIGTProjection )
names( igtProjs ) <- names( probMaps )
We can observe as an example the data temporal map and information geometry temporal (IGT) plot of the variable type of anesthesia during delivery DEL_TYPE_ANE:
plotDataTemporalMap(probMaps$DEL_TYPE_ANE)
plotIGTProjection(igtProjs$DEL_TYPE_ANE, trajectory = TRUE)
In this example, we can see that over time there are changes or differences associated with synonymy of terms (No anesthesia, no, Without anesthesia), or even differences between upper and lower case (spinal, Spinal).
Next, we are going to save the results of the temporal variability estimation in a file, so that we can upload them to the web application EHRtemporalVariability, in the Load your RData tab.
save(probMaps, igtProjs, file = "variabilityData.RData")
Either using the web application, or obtaining the graphs in RStudio, we will study the temporal evolution of the type of delivery variable DEL_MODE, and discuss the results and their potential implications in the problem defined at the beginning of the practice on the monitoring of the early onset of lactation indicator, as well as propose a possible solution.
📝 DISCUSS RESULTS
…
The maternal and infant data studied have been kindly provided by Ricardo García de León for educational purposes only. Their use or distribution outside this subject and practice is forbidden.